Search CORE

21 research outputs found

Improving the performance of dictionary-based approaches in protein name recognition

Author: Tsujii Jun’ichi
Tsuruoka Yoshimasa
Publication venue: Elsevier Inc.
Publication date: 01/12/2004
Field of study

AbstractDictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%

Elsevier - Publisher Connector

The University of Manchester - Institutional Repository

An analysis of gene/protein associations at PubMed scale

Author: Ohta Tomoko
Pyysalo Sampo
Tsujii Jun’ichi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available. Results In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology. Conclusions We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization

Author: A. L. Berger
C Tan
J.N Darroch
Jun’ichi Kazama
Jun’ichi Tsujii
S. Della Pietra
S.F Chen
W Newman
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Medie and Info-pubmed: 2010 update

Author: Hw Chun
JD Kim
Jun’ichi Tsujii
K Hirohata
Makoto Miwa
Naoaki Okazaki
Rune Sætre
Sampo Pyysalo
T Ninomiya
Takuya Matsuzaki
Tomoko Ohta
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Mining metabolites: extracting the yeast metabolome from the literature

Author: Chikashi Nobata
CR Batchelor
D Banville
D Broadhurst
D Jiao
DB Kell
Douglas B. Kell
GA Eller
J Brecher
J Finkel
J Townsend
J Wisniewski
J Wren
JD Kim
JD Kim
Jun’ichi Tsujii
K Degtyarenko
KM Hettne
L Goebels
M Hucka
M Kanehisa
M Kanehisa
M Kanehisa
M Krallinger
N Okazaki
P Corbett
P Mendes
Paul D. Dobson
PD Dobson
Pedro Mendes
R Hoffmann
R Klinger
S Ananiadou
S Ananiadou
S Ananiadou
Sophia Ananiadou
Syed A. Iqbal
X Wang
Y Kano
Y Kano
Y Miyao
Y Sasaki
Y Tsuruoka
Y Tsuruoka
Publication venue: Springer US
Publication date: 01/01/2011
Field of study

Text mining methods have added considerably to our capacity to extract biological knowledge from the literature. Recently the field of systems biology has begun to model and simulate metabolic networks, requiring knowledge of the set of molecules involved. While genomics and proteomics technologies are able to supply the macromolecular parts list, the metabolites are less easily assembled. Most metabolites are known and reported through the scientific literature, rather than through large-scale experimental surveys. Thus it is important to recover them from the literature. Here we present a novel tool to automatically identify metabolite names in the literature, and associate structures where possible, to define the reported yeast metabolome. With ten-fold cross validation on a manually annotated corpus, our recognition tool generates an f-score of 78.49 (precision of 83.02) and demonstrates greater suitability in identifying metabolite names than other existing recognition tools for general chemical molecules. The metabolite recognition tool has been applied to the literature covering an important model organism, the yeast Saccharomyces cerevisiae, to define its reported metabolome. By coupling to ChemSpider, a major chemical database, we have identified structures for much of the reported metabolome and, where structure identification fails, been able to suggest extensions to ChemSpider. Our manually annotated gold-standard data on 296 abstracts are available as supplementary materials. Metabolite names and, where appropriate, structures are also available as supplementary materials

Crossref

PubMed Central

The University of Manchester - Institutional Repository

Probabilistic CFG with Latent Annotations

Author: Jun’ichi Tsujii
Takuya Matsuzaki
Yusuke Miyao
Publication venue
Publication date: 01/01/2005
Field of study

This paper defines a generative probabilistic model of parse trees, which we call PCFG-LA. This model is an extension of PCFG in which non-terminal symbols are augmented with latent variables. Finegrained CFG rules are automatically induced from a parsed corpus by training a PCFG-LA model using an EM-algorithm. Because exact parsing with a PCFG-LA is NP-hard, several approximations are described and empirically compared. In experiments using the Penn WSJ corpus, our automatically trained model gave a performance of 86.6 % (F ¥ , sentences ¦ 40 words), which is comparable to that of an unlexicalized PCFG parser created using extensive manual feature selection

CiteSeerX

Crossref

Multi-Multi-View Learning: Multilingual and Multi-Representation Entity Typing

Author: Chiang David
Hockenmaier Julia
Riloff Ellen
Schütze Hinrich
Tsujii Jun’ichi
Yaghoobzadeh Yadollah
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/10/2018
Field of study

Knowledge bases (KBs) are paramount in NLP. We employ multiview learning for increasing accuracy and coverage of entity type information in KBs. We rely on two metaviews: language and representation. For language, we consider high-resource and lowresource languages from Wikipedia. For representation, we consider representations based on the context distribution of the entity (i.e., on its embedding), on the entity’s name (i.e., on its surface form) and on its description in Wikipedia. The two metaviews language and representation can be freely combined: each pair of language and representation (e.g., German embedding, English description, Spanish name) is a distinct view. Our experiments on entity typing with fine-grained classes demonstrate the effectiveness of multiview learning. We release MVET, a large multiview – and, in particular, multilingual – entity typing dataset we created. Mono- and multilingual finegrained entity typing systems can be evaluated on this dataset

Open Access LMU

Neural Transductive Learning and Beyond: Morphological Generation in the Minimal-Resource Setting

Author: Chiang David
Hockenmaier Julia
Kann Katharina
Riloff Ellen
Schütze Hinrich
Tsujii Jun’ichi
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/10/2018
Field of study

Neural state-of-the-art sequence-to-sequence (seq2seq) models often do not perform well for small training sets. We address paradigm completion, the morphological task of, given a partial paradigm, generating all missing forms. We propose two new methods for the minimalresource setting: (i) Paradigm transduction: Since we assume only few paradigms available for training, neural seq2seq models are able to capture relationships between paradigm cells, but are tied to the idiosyncracies of the training set. Paradigm transduction mitigates this problem by exploiting the input subset of inflected forms at test time. (ii) Source selection with high precision (SHIP): Multi-source models which learn to automatically select one or multiple sources to predict a target inflection do not perform well in the minimal-resource setting. SHIP is an alternative to identify a reliable source if training data is limited. On a 52-language benchmark dataset, we outperform the previous state of the art by up to 9.71% absolute accuracy

Open Access LMU